BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
نویسندگان
چکیده
We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages better than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpemb.
منابع مشابه
DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis
In this paper we present two deep-learning systems that competed at SemEval-2017 Task 4 “Sentiment Analysis in Twitter”. We participated in all subtasks for English tweets, involving message-level and topic-based sentiment polarity classification and quantification. We use Long Short-Term Memory (LSTM) networks augmented with two kinds of attention mechanisms, on top of word embeddings pre-trai...
متن کاملBetter Word Embeddings for Korean
Vector representations of words that capture semantic and syntactic information accurately is critical for the performance of models that use these vectors as inputs. Algorithms that only use the surrounding context at the word level ignore the subword level relationships which carry important meaning especially for languages that are highly inflected such as Korean. In this paper we compare th...
متن کاملQuotient Complexity of Bifix-, Factor-, and Subword-Free Regular Languages
A language L is prefix-free if, whenever words u and v are in L and u is a prefix of v, then u = v. Suffix-, factor-, and subword-free languages are defined similarly, where “subword” means “subsequence”. A language is bifix-free if it is both prefixand suffix-free. We study the quotient complexity, more commonly known as state complexity, of operations in the classes of bifix-, factor-, and su...
متن کاملQLUT at SemEval-2017 Task 1: Semantic Textual Similarity Based on Word Embeddings
This paper reports the details of our submissions in the task 1 of SemEval 2017. This task aims at assessing the semantic textual similarity of two sentences or texts. We submit three unsupervised systems based on word embeddings. The differences between these runs are the various preprocessing on evaluation data. The best performance of these systems on the evaluation of Pearson correlation is...
متن کاملAdapting Pre-trained Word Embeddings For Use In Medical Coding
Word embeddings are a crucial component in modern NLP. Pre-trained embeddings released by different groups have been a major reason for their popularity. However, they are trained on generic corpora, which limits their direct use for domain specific tasks. In this paper, we propose a method to add task specific information to pre-trained word embeddings. Such information can improve their utili...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1710.02187 شماره
صفحات -
تاریخ انتشار 2017